This is the notebook for the Module 4 Programming Assignment.
Here are a few tips for using the iPython HTML notebook:
You should rename this notebook to be <your JHED id>_PR1.ipynb Do it right now.
Make certain the entire notebook executes before you submit it. (See Hint #5 above)
Change the following variables:
In [1]:
name = "Shehzad Qureshi"
jhed_id = "squresh6"
if name == "Student Name" or jhed_id == "sname1":
raise Exception( "Change the name and/or JHED ID...preferrably to yours.")
Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link) Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).
In [2]:
from IPython.core.display import *
from StringIO import StringIO
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
This problem is very similar to the one in Module 1 that we solved with A* search. The main difference is that we now have stochastic movement (actions don't always result in the desired successor state) so we can't use A* search to solve it. Instead you're going to use Q-Learning (reinforcement learning).
As before, we're going to simplify the problem by working in a grid world. The symbols that form the grid have a special meaning as they specify the type of the terrain and the cost to enter a grid cell with that type of terrain:
token terrain cost
. plains 1
When you go from a plains node to a forest node it costs 3. When you go from a forest node to a plains node, it costs 1. You can think of the grid as a big graph. Each grid cell (terrain symbol) is a node and there are edges to the north, south, east and west (except at the edges). Unlike A* Search, Reinforcement Learning will find a policy for the entire state space. The goal is (26, 26) with a reward of 100.0.
To avoid global variables, we have a read_world()
function that takes a filename and returns the world as list of lists. The same coordinates reversal applies: (x, y) is world[ y][ x].
In [3]:
def read_world( filename):
with open( filename, 'r') as f:
world_data = [x for x in f.readlines()]
f.closed
world = []
for line in world_data:
line = line.strip()
if line == "": continue
world.append([x for x in line])
return world
Next we create a dict of movement costs. Note that we've negated them this time because RL requires negative costs and positive rewards:
In [4]:
# do not reference this as a global variable.
costs = { '.': -1, '*': -3, '^': -5, '~': -7}
costs
Out[4]:
and a list of offsets for NEIGHBORS. You'll need to work this into your actions, A, parameter.
In [5]:
NEIGHBORS = [(0,-1), (1,0), (0,1), (-1,0)]
The transition function, T, is 0.70 for the desired direction, and 0.10 each for the other possible directions. That is, if I select "up" then 70% of the time, I go up but 10% of the time I go left, 10% of the time I go right and 10% of the time I go down. If you're at the edge of the map, you simply bounce back to the current state.
You need to implement q_learning()
with the following parameters:
you will return a policy: {(x1, y1): action1, (x2, y2): action2, ...} Remember...a policy is what to do in any/every state. Notice how this is different that A* search which only returns one path from start to goal. For a 2d navigation problem, you can also print out a map of what action to do in every state.
Use a goal that is the lower right hand corner of the world.
There are a lot of details that I have left up to you. Watch and re-watch the lecture on Q-Learning. Ask questions. You need to implement a way to pick initial states for each iteration and you need a way to balance exploration and exploitation while learning. You may have to experiment with different gamma and alpha values.
Remember that you should follow the general style guidelines for this course: well-named, short, focused functions with limited indentation using Markdown documentation that explains their implementation and the AI concepts behind them.
This assignment sometimes wrecks havoc with IPython notebook. Create a small test world (5x6) to work with. Pick an asymmetric world to avoid edge cases.
In [6]:
debug = False
iter_max = 10000
episodes = 10000
epsilon = 0.15
omega = 0.01
Ne = 5
Pe = 10
Now we need a way to initialize our Q Matrix. This matrix will be a function of the size of the world and the number of actions per state. We will also set our goal state to the reward value provided.
In [7]:
def initialize_q_matrix(world, actions, goal, reward):
xlen, ylen, alen = len(world[0]), len(world), len(actions)
Q = np.zeros((ylen, xlen, alen))
x, y = goal
Q[y][x] = reward
return Q
Next we need a function to pick an initial state. We shall try to start closer to the goal and work our way further and further away.
In [8]:
def initialize_state(world, goal, episode):
cx, cy = int(goal[0] - (float(episode)/Pe)), int(goal[1] - (float(episode)/Pe))
if cx < 0: cx = 0
if cx < 0: cy = 0
bx, by = len(world[0]), len(world)
positions = []
positions.extend([(i,cy) for i in xrange(cx, bx) if world[cy][i] in costs])
positions.extend([(cx,i) for i in xrange(cy, by) if world[i][cx] in costs])
if goal in positions: positions.remove(goal)
state = positions[np.random.randint(0, len(positions))]
if debug: print("Initialized state to {0}".format(state))
return state
We also need a method to pick an action once a state has been selected. This shall be a combination of methods:
In [9]:
def select_action(Q, N, s):
x, y = s
n = [i for i,v in enumerate(N[y][x] < Ne) if v]
if len(n) > 0: action = np.random.choice(n)
else:
e = np.random.random()
if e < epsilon: action = np.random.randint(0, len(Q[y][x]))
else: action = np.argmax(Q[y][x])
N[y][x][action] = N[y][x][action] + 1
if debug: print(" Selecting {0} from {1} for {2}".format(action, Q[y][x], (x,y)))
return action
Every action needs to be simulated with the transition model given above. The agent has no knowledge of this transition model. We also do bounds checking for each action and every out-of-bounds move will result in staying in the same state but will yield a larger penalty.
In [10]:
def execute_action(world, costs, state, action):
p = [0.7 if x == action else 0.1 for x in xrange(len(NEIGHBORS))]
if debug: print(" Attempting to move {0} from {1} with p={2}".format(NEIGHBORS[action],state,p))
move = NEIGHBORS[np.random.choice(xrange(len(NEIGHBORS)), p=p)]
bound_x, bound_y = len(world[0]) - 1, len(world) - 1
x, y = state[0] + move[0], state[1] + move[1]
if x < 0 or x > bound_x or y < 0 or y > bound_y or world[y][x] not in costs:
new_state, reward = state, -100
else:
new_state = (x,y)
reward = collect_reward(world, costs, new_state)
if debug: print(" Executing {0} from {1} to {2}".format(move,state,new_state))
return new_state, reward
Naturally, every action will result in a reward.
In [11]:
def collect_reward(world, costs, new_state):
x, y = new_state
reward = costs[world[y][x]]
if debug: print(" Collecting reward of {0} from {1}".format(reward,new_state))
return reward
This method will convert our final Q matrix into an actual policy. The policy is a dictionary of each available state mapped to the action with the highest value.
In [12]:
def get_policy(Q):
policy = dict()
y, x, a = Q.shape
for i in xrange(y):
for j in xrange(x):
policy[(j,i)] = NEIGHBORS[np.argmax(Q[i][j])]
return policy
The Q-Learning function is defined below.
In [13]:
def q_learning(world, costs, goal_state, goal_reward, actions, gamma, alpha):
Q = initialize_q_matrix(world, actions, goal_state, goal_reward)
N = np.zeros_like(Q)
for i in xrange(episodes):
s, epcopy, iter_count = initialize_state(world, goal_state, i + 1), np.copy(Q), 0
while not s == goal_state:
a, (sx,sy) = select_action(Q, N, s), s
(nx,ny), reward = execute_action(world, costs, s, a)
Q[sy][sx][a] = (1-alpha)*(Q[sy][sx][a]) + (alpha*(reward+(gamma*np.max(Q[ny][nx]))))
s, iter_count = (nx, ny), iter_count + 1
if iter_count > iter_max: break
if np.max(Q - epcopy) < omega: break
return get_policy(Q)
We need a way to visualize our results so we shall use matplotlib once again and borrow Problem 1's definitions. The resulting plot will have two plots side-by-side, one for the map and one for the policy. Overlaying the policy on top of the map did not seem visually appealing.
In [14]:
def setup_map(ax, world):
c = dict()
c['.'] = ('.', 0.4)
c['x'] = ('x', 0.7)
c['^'] = ('^', 0.4)
c['*'] = ('*', 0.4)
c['~'] = ('d', 0.4)
for y, _y in enumerate(world):
for x, _x in enumerate(_y):
params = c[world[y][x]]
ax.plot(x, y, params[0], color='black', alpha=params[1])
This will draw our policy on the given axis.
In [15]:
def overlay_policy(ax, goal, policy, world):
movement_map = ["^", ">", "v", "<"]
for x, y in policy:
if (x, y) == goal: symbol = "H"
else: symbol = movement_map[NEIGHBORS.index(policy[(x,y)])] if world[y][x] in costs else "x"
ax.plot(x, y, symbol, color='blue', alpha=0.8)
And now our main plotting function. We must remember to invert the y-axis after plotting the data.
In [16]:
def plot_policy(world, goal, policy):
fig, (ax_map, ax_pol) = plt.subplots(1, 2)
ax_map.set_axis_off()
ax_pol.set_axis_off()
ax_map.set_xlim((-1, len(world[0])))
ax_map.set_ylim((-1, len(world)))
ax_pol.set_xlim((-1, len(world[0])))
ax_pol.set_ylim((-1, len(world)))
setup_map(ax_map, world)
overlay_policy(ax_pol, goal, policy, world)
ax_map.invert_yaxis()
ax_pol.invert_yaxis()
plt.show()
Test on a smaller map first.
In [17]:
world = read_world("test_world.txt")
goal = (6,6)
%time policy = q_learning(world, costs, goal, 200.0, NEIGHBORS, 0.75, 0.25)
plot_policy(world, goal, policy)
I had to play with the discount factor and rewards for a bit to get the policy that I thought was good enough. Due to the randomness of the algorithm (selecting the initial state, selecting the actions, simulation of the transition model) no two runs are exactly alike.
And now we'll test it with our final world model from problem 1.
In [18]:
world = read_world("world.txt")
goal = (26, 26)
Ne = 5
epsilon = 0.15
Pe = 500
episodes = 1000000
iter_max = 1000000
%time policy = q_learning(world, costs, goal, 500.0, NEIGHBORS, 0.8, 0.25)
plot_policy(world, goal, policy)
I don't think this is the perfect policy, but I'm not sure there is one. I do notice that there may be loops if the agent starts on certain squares. I wonder if this has anything to do with the simulation of the transition model.
Summary
Reinforcement Learning is a fascinating AI algorithm that has many applications including Robotics, a topic I am very fond of. I can see how this algorithm can be used to get a robot to learn about its environment even if it knows relatively nothing about it.
This isn't a trivial algorithm to understand though. It seems that the outcome of the algorithm is highly dependent on the input and structure of the algorithms itself:
In [19]:
# my testing